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Abstract 

Recent work has shown the necessity of considering an 
attacker's background knowledge when reasoning about 
privacy in data publishing. However, in practice, the data 
publisher does not know what background knowledge the 
attacker possesses. Thus, it is important to consider the 
worst-case. In this paper, we initiate a formal study of 
worst-case background knowledge. We propose a language 
that can express any background knowledge about the data. 
We provide a polynomial time algorithm to measure the 
amount of disclosure of sensitive information in the worst 
case, given that the attacker has at most k pieces of infor- 
mation in this language. We also provide a method to effi- 
ciently sanitize the data so that the amount of disclosure in 
the worst case is less than a specified threshold. 



1. Introduction 

We consider the following situation. A data publisher 
(such as a hospital) has collected useful information about 
a group of individuals (such as patient records that would 
help medical researchers) and would like to publish this data 
while preserving the privacy of the individuals involved. 
The information is stored as a table (as in Figure 1) where 
each record corresponds to a unique individual and contains 
a sensitive attribute (e.g., disease) and some non-sensitive 
attributes (e.g., address, gender, age) that might be learned 
using externally available data (e.g., phone books, birth 
records). The data publisher would like to limit the disclo- 
sure of the sensitive values of the individuals in order to de- 
fend against an attacker who possibly already knows some 
facts about the table. Our goal in this paper is to quantify 
the precise effect of background knowledge possessed by 
an attacker on the amount of disclosure and to provide al- 
gorithms to check and ensure that the amount of disclosure 
is less than a specified threshold. 

The problem we solve is of real and practical im- 
portance; an egregious example of a privacy breach was 



the discovery of the medical records of the Governor of 
Massachusetts from an easily accessible and supposedly 
anonymized dataset. All that was needed was to link it to 
voter registration records [32]. To defend against such at- 
tacks, Samarati and Sweeney [29] introduced a privacy cri- 
terion called fc-anonymity which requires that each individ- 
ual be indistinguishable (with respect to the non-sensitive 
attributes) from at least fc — 1 others. This is done by group- 
ing individuals into buckets of size at least fc, and then per- 
muting the sensitive values in each bucket and sufficiently 
masking their externally observable non-sensitive attributes. 
Figure 2 depicts a table that is a 5-anonymous version of the 
table in Figure 1 . Figure 3 depicts the permutation of sensi- 
tive values that was used to construct this table. 

However, fc-anonymity does not adequately protect the 
privacy of an individual;' for example, when all individuals 
in a bucket have the same disease, the disease of the indi- 
viduals in that bucket is disclosed regardless of the bucket 
size. Even when there are multiple diseases in the same 
bucket, the frequencies of the diseases in the bucket still 
matter when an attacker has some background knowledge 
about the particular individuals in the table. Suppose the 
data publisher has published the 5-anonymous table as de- 
picted in Figure 2. Consider an attacker AUce who would 
like to learn the diseases of all her friends and neighbors. 
One of her neighbors is Ed, a 27 year-old male living in 
Ithaca (zip code 14850). Alice knows that Ed is in the hos- 
pital that published the anonymized dataset in Figure 3, and 
she wants to find out Ed's disease. Using her knowledge 
of Ed's age, gender, and zip-code, Alice can identify the 
bucket in the anonymized table that Ed belongs to (namely, 
the first bucket). Alice does not know which disease listed 
within that bucket is Ed's since the sensitive values were 
permuted. Therefore, without additional knowledge, Al- 
ice's estimate of the probability that Ed has lung cancer is 
2/5. But suppose Alice knows that Ed had mumps as a child 
and is therefore extremely unlikely to get it again. After rul- 
ing out this possibility, the probability that Ed has lung can- 



' Indeed, the definition of fc-anonymity does not even mention the sen- 
sitive attribute! 
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Figure 1 . Original table 



Figure 2. 5-anonymous table 



Figure 3. Bucketized table 



car increases to 1/2. Now, if Alice also somehow discovers 
that Ed does not have flu, then the fact that he has lung can- 
cer becomes certain. Here, two pieces of knowledge of the 
form "Ed does not have X" were enough to fully disclose 
Ed's disease. To guard against this, Machanavajjhala et al. 
[24| proposed a privacy criterion called ^-diversity that en- 
sures that it takes at least £ ~ 1 such pieces of information 
to sufficiently disclose the sensitive value of any individual. 
The main idea is to require that, for each bucket, the £ most 
frequent sensitive values are roughly equi-probable. 

^-diversity focuses on one type of background knowl- 
edge: knowledge of the form "individual X does not have 
sensitive value Y". But an attacker might well have other 
types of background knowledge. For example, suppose Al- 
ice lives across the street from a married couple, Charlie and 
Hannah, who were both taken to the hospital. Once again, 
using her knowledge of their genders, ages and zip-codes, 
Alice can identify the buckets Charlie and Hannah belong 
to. Without additional background knowledge, Alice thinks 
that Charhe has the flu with probability 2/5. But suppose 
that Alice knows that Hannah has had a flu shot recently but 
Charlie has not. Believing Hannah's immunity to the flu to 
be much stronger than Charhe' s and knowing that they live 
together, Alice deduces that if Hannah has succumbed to the 
flu then it is extremely likely that Charlie has as well. This 
knowledge allows her to update her probability that Charlie 
has the flu to 10/19. We show how these probabilities are 
computed in Section 3. ^-diversity does not guard against 
the type of background knowledge in this example. 

It is thus clear that we need a more general-purpose 
framework that can capture knowledge of any property of 
the underlying table that an attacker might know. More- 
over, unlike in the two examples above where we knew Al- 
ice's background knowledge, we will not assume that we 
know exactly what the attacker knows. We therefore take 
the following approach. In Section 2, we propose a lan- 
guage that is expressive enough to capture any property of 
the sensitive values in a table. This language enables us to 
decompose background knowledge into basic units of in- 
formation. Then, given an anonymized version of the table, 



we can quantify the worst-case disclosure risk posed by an 
attacker with k such units of information; k can be thought 
of as a bound on the power of an attacker In Section 3, 
we show how to efficiently preserve privacy by ensuring 
that the worst-case (i.e., maximum) disclosure for any k 
pieces of information is less than a specified threshold. Fur- 
thermore, we show to integrate our techniques into existing 
frameworks to find a "minimally sanitized" table for which 
the maximum disclosure is less than a specified threshold. 
We present experiments in Section 4, related work in Sec- 
tion 5, and we conclude in Section 6. 

To the best of our knowledge, this is the first such formal 
analysis of the effect of unknown background knowledge 
on the disclosure of sensitive information. 

2. Framework 

We begin by modeling the data publishing situation for- 
mally. Let P be a (finite) set of people. For each p E P,we 
associate a tuple tp which has one sensitive attribute S (e.g., 
disease) with finite domain and one or more non-sensitive 
attributes. We overload notation and use S to represent 
both the sensitive attribute and its domain. The data pub- 
lisher has a table T, which is a set of tuples corresponding 
to a subset of P. The publisher would like publish T in a 
form that protects the sensitive information of any individ- 
ual from an attacker with background knowledge that can 
be expressed in a language C. (We propose such a language 
to express background knowledge in Section 2.2.) 

2.1. Buclietization 

We first need to carefully describe how the published 
data is constructed from the underlying table if we are to 
correctly interpret this published data. That is, we need to 
specify a sanitization method. We briefly describe two pop- 
ular sanitization methods. 

• The first, which we term bucketization [34], is to par- 
tition the tuples in T into buckets, and then to sepa- 
rate the sensitive attribute from the non-sensitive ones 



by randomly permuting the sensitive attribute values 
within each bucket. The sanitized data then consists of 
the buckets with permuted sensitive values. 

• The second sanitization technique is full-domain gen- 
eralization [32], where we coarsen the non-sensitive 
attribute domains. The sanitized data consists of the 
coarsened table along with generalization used. Note 
that, unlike bucketization, the exact values of the non- 
sensitive attributes are not released; only the coarsened 
values are released. 

Note that if the attacker knows the set of people in the table 
and their non-sensitive values, then full-domain generaliza- 
tion and bucketization are equivalent. In this paper, we use 
bucketization as the method of constructing the published 
data from the original table T, although all our results hold 
for full-domain generalization as well. We plan to extend 
our algorithms to work for other sanitization techniques, 
such as data swapping [10] (which, like bucketization, also 
permutes the sensitive values, but in more complex ways) 
and suppression [29], in the future. 

We now specify our notion of bucketization more for- 
mally. Given a table T, we partition the tuples into buckets 
(i.e., horizontally partition the table T according to some 
scheme), and within each bucket, we apply an independent 
random permutation to the colutmi containing S'-values. 
The resulting set of buckets, denoted by B, is then pub- 
lished. For example, if the underlying table T is as de- 
picted in Figure 1, then the publisher might pubhsh buck- 
etization B as depicted in Figure 3. Of course, for added 
privacy, the publisher can completely mask the identifying 
attribute (Name) and may partially mask some of the other 
non-sensitive attributes (Age, Sex, Zip). 

For a bucket 6 e B, we use the following notation. 



Pb set of people p G P with tuples tp Gb 
Ub number of tuples in b 
nb{s) frequency of sensitive value s S S* in 6 
s° , s J , . . . sensitive values in decreasing order 
of frequency in b 



2.2. Background Knowledge 

We pessimistically assume that the attacker has man- 
aged to obtain complete information about which individ- 
uals have records in the table, what their non-sensitive data 
is, and which buckets in the bucketization these records fall 
into. That is, we assume that the attacker knows Pb, the 
set of people in bucket b, for each b E B, and knows tp [X] 
for every person p in the table and every non-sensitive at- 
tribute X. We call this full identification information. One 
way of obtaining identification information in practice is to 



link quasi-identifying non-sensitive attributes published in 
the bucketization (e.g., address, gender, age) with pubUcly 
available data (e.g., phone directories, birth records) [32]. 

We make the standard random worlds assumption [6]: in 
the absence of any further knowledge, we consider all ta- 
bles consistent with this bucketization to be equally likely. 
That is, the probability of tp E b having s for its sensitive 
attribute is nb{s)/nb since each assignment of sensitive at- 
tributes to tuples within a bucket is equally likely. 

We now need to consider knowledge beyond the identi- 
fication information that an attacker might possess. We as- 
sume that this further knowledge is the knowledge that the 
underlying table satisfies a given predicate on tables. That 
is, the attacker knows that the underlying table is among the 
set of tables satisfying the given predicate. This is a rather 
general assumption. For example, "the average age of heart 
disease patients in the table is 48 years" could be one such 
predicate. In order to quantify the power of such knowl- 
edge, we use the notion of a basic unit of knowledge, and 
we propose a language which consists of finite conjunctions 
of such basic units. Given full identification information, 
we desire that any predicate on tables be expressible using 
a conjunction of the basic units that we propose. We employ 
a very simple propositional syntax. 

Definition 1 (Atoms) An atom is a formula of the form 
tp [S] = s, for some value s € S and person p G P with 
tuple tp €T. We say that atom tp[S] = s involves person p 
and value s. 

The interpretation of atoms is obvious: f jack [Disease] = flu 
says that the Jack's tuple has the value flu for the sensitive 
attribute Disease. 

The basic units of knowledge in our language are basic 
implications, defined below. 

Definition 2 (Basic implications) A basic implication is a 
formula of the form 

(Aie[m]Aj) (Vje[„]-Bj) 

for some m > l,n > 1 andatoms Ai, Bj, i e [in],j e [n] 
(note that we use the standard notation [n] to denote the set 
{0,...,n-l}). 

The fact that basic implications are a sufficiently expressive 
"basic unit" of knowledge is made precise by the following 
theorem.-^ 

Theorem 3 (Completeness) Given full identification in- 
formation and any predicate on tables, one can express the 
knowledge that the underlying table satisfies the identifica- 
tion information and the given predicate using a finite con- 
junction of basic implications. 

^See [25] for proofs. 



Hence we can model arbitrarily powerful attackers.^ Con- 
sider an attacker who knows the disease of every person in 
the table except for Bob. Then publishing any bucketiza- 
tion will reveal Bob's disease. To avoid pathological and 
unreahstic cases like this, we need to assume a bound on 
the power of an attacker. We model attackers with bounded 
power by limiting the number of basic implications that the 
attacker knows. That is, the attacker knows a single formula 
from language ^^asic defined below. 

Definition 4 £basic " language consisting of conjunc- 
tions of k basic implications. That is, -C^asic consists of 
formulas of the form /\i^[k]^i where each tpi is a basic im- 
plication. 

k can thus be viewed as a bound on the attacker's power 
and can be increased to provide more conservative privacy 
guarantees. 

Note that our choice of basic impUcations for the "ba- 
sic unit" of our language has important consequences on 
our assumptions about the attacker's power. In particu- 
lar, some properties of the underlying table might require 
a large number of basic implications to express. Since ba- 
sic imphcations are essentially CNF clauses with at least 
one negative atom, our language suffers from an exponen- 
tial blowup in the number of basic units required to express 
arbitrary DNF formulas. It may be that other choices of ba- 
sic units may lead to equally expressive languages while at 
the same time requiring fewer basic units to express certain 
natural properties, and we consider this an important direc- 
tion for future research. Nevertheless, many natural types of 
background knowledge have succinct representations using 
basic implications. For example, Alice's knowledge that "if 
Hannah has the flu, then Charlie also has the flu" is simply 
the basic impUcation 

iHannah [Disease] = flu — i- tcharlie [Disease] = flu 

And the knowledge that "Ed does not have flu" is 

tEd [Disease] = flu ^ tEd [Disease] = ovarian cancer 

In general, we can represent -^t[S] = s by {t[S] = s) ^ 
{t[S] = s') for any choice of s' ^ s since each tuple has 
exactly one sensitive attribute value. 

Note that maintaining privacy when there is dependence 
between sensitive values, especially across buckets, is a 
problem that has not been previously addressed in the pri- 
vacy literature. The assignments of individuals to sensitive 
values in different buckets are not necessarily independent. 
As we saw in the example with Haimah and Charlie, fix- 
ing a particular assignment in one bucket could affect what 



assignments are possible in another. One of the contribu- 
tions of this paper is that we provide a polynomial time al- 
gorithm for computing the maximum disclosure even when 
the attacker has knowledge of such dependencies. 

2.3. Disclosure 

Having specified how the bucketization B is constructed 
from the underlying table T and how an attacker's knowl- 
edge about sensitive information can be expressed in lan- 
guage >Cbasic' ^J"^ ^ ^ position to define our notion 
of disclosure precisely. 

Definition 5 (Disclosure risk) The disclosure risk of buck- 
etization B with respect to background knowledge repre- 
sented by some formula if in language ^C^g^gj^ is 

max Pr(tp[5]=s|^A^) 

That is, disclosure risk is the likelihood of the most highly 
predicted sensitive attribute assignment. 

Definition 6 (Maximum disclosure) The maximum dis- 
closure of bucketization B with respect to language CJ^^^if. 
that expresses background knowledge is 



max 



Pr(ip[S'] = s|i3Av?) 



'A major shortcoming of the ^-diversity definition was that its choice of 
"basic unit" of knowledge was essentially negated atoms (i.e., -itp[5] = s) 
which cannot capture all properties of the underlying table. For example, 
negations cannot express basic implications in general. 



By our assumptions in 2.2, we compute Pr(tp [5] = s\BALp) 
by considering the set of all tables consistent with bucketi- 
zation B and with background knowledge and then taking 
the fraction of those tables that satisfy tp[S] = s. Using 
this, the maximum disclosure of the bucketization in Figure 
3 with respect to jCy^^^^ turns out to be and occurs when 
(fi is tp' = s' ^ tp = s where p is a person in the first 
bucket, p' is a person in the second bucket, and s and s' are 
both flu. Our goal is to develop general techniques to: 

1. efficiently calculate the maximum disclosure for any 
given bucketization, and 

2. efficiently find a "minimally sanitized" bucketization"^ 
(or the set of all minimally sanitized bucketizations) 
for which the maximum disclosure is below a specified 
threshold (if any exist). 

3. Checking And Enforcing Privacy 

In Section 2.2, we defined basic implications as the "unit 
of knowledge" and showed that this was a fuUy expressive 



"•We will make precise the notion of "minimally sanitized" in Section 
3.4; we want "minimal sanitization" in order to preserve the utility of the 
data. 



(in the presence of full identification information) and rea- 
sonable choice. We now show how to efficiently calculate 
and limit maximum disclosure against an attacker who has 
full identification information and has up to k additional 
pieces of background knowledge (i.e., up to k basic impli- 
cations). In order to do this, we will show in Theorem 9 that 
there is a set of k basic implications that maximizes disclo- 
sure with respect to ^basic- Furthermore, each such impU- 
cation has only one atom in the antecedent and one atom in 
the consequent. This motivates the following definition. 

Definition 7 (Simple implications) A simple impUcation 
is a formula of the form A B for some atoms A, B. 

3.1. Hardness of computing disclosure risk 

Unfortunately, naive methods for computing the maxi- 
mum disclosure will not work - in fact, we can show that 
computing the disclosure risk of a given bucketization with 
respect to a given set of k simple implications is #P-hard. 
Note that k simple implications can be written in 2-CNF, 
for which satisfiability is easily checkable. Complexity is 
introduced in trying to simultaneously satisfy the k implica- 
tions and the given bucketization. In fact, deciding whether 
a given bucketization is consistent with a set of k simple 
implications is NP-complete. 

Theorem 8 Given as input bucketization B and a conjunc- 
tion of simple implications ip, the problem of deciding ifB 
and are both satisfiable by some table T is l^P-complete. 
Moreover, given an atom C as further input, the problem of 
computing Pr(C | B A f\i^m^i) is ^P-complete. 

3.2. A special form for maximum disclosure 

It turns out that, despite the hardness results above, com- 
puting the maximum disclosure with respect to language 
'^basic done in polynomial time. The key insight 

is summarized in Theorem 9. 

Theorem 9 For any bucketization, there is a set of k sim- 
ple implications, all sharing the same consequent, such that 
the conjunction of these k simple implications maximizes 
disclosure with respect to ^basic 

This insight is tremendously useful in devising a 
polynomial-time dynamic programming algorithm for com- 
puting the maximum disclosure with respect to -Cfjasic 
allows us to restrict our attention to sets of k simple impli- 



cations of the form [tp. [S] — 



{tp[S] = s) for people 



p,Pi G P, and values s,Si G S, i G [k]. The proof of 
Theorem 9 follows from the following two lemmas. 

Lemma 10 For any formulas ip, (f, 6i, (pi, 

Pr(<^|VA(A,g[fc](ei^(^i))) 
< Pr((^|VA(Aie[fc](^i^(^))) 



Starting with any set of k basic implications that maxi- 
mize disclosure,^ Lemma 10 enables us to replace the con- 
sequent in all the basic implications by a single common 
atom (namely the atom corresponding to the highest pre- 
dicted assignment of sensitive value to an individual), while 
still maintaining maximum disclosure. 

Lemma 11 For any formulas ip, B, 9i, where B is an atom 
and di is a conjunction of atoms, there exist atoms Ai such 
that 

Pr(i?|^^A(Aie[fc](^^^S))) 

< Pr(B| VA(A,e[fc](A, ^B))). 

Next, Lemma 11 allows us to replace the antecedent of 
each of the resulting implications by an atom (possibly with 
a different atom for each implication), while stiU maintain- 
ing maximum disclosure. 

In both Lemmas 10 and 11, we use tf} to represent the at- 
tacker's knowledge about the bucketization B. However, it 
is worthwhile pointing out that neither lermna places any re- 
striction on tf} or on the underlying probability distribution. 
This makes the results presented here extremely general and 
powerful because they characterize the form of background 
knowledge that maximizes disclosure risk for any form of 
anonymization and for any additional background knowl- 
edge. 

The main idea behind the proof of Lemma 10 (and also 
Lemma 1 1) can be illustrated as follows. Consider a buck- 
etization B. Let {tpi[S] = Si) {tp'.[S] = s'i), for 
i G {0, 1}, be two simple implications which maximize the 
disclosure of B with respect to -C^asic- For convenience, 
we let Ai denote the atom tp^ [S] = Si and Bi the atom 
tpi\S\ = s'i. Let C be the atom tp[S] = s such that 
Pr(C I B A (Aig[2] {Ai Bi))) is the maximum disclosure. 

Now let us restrict our attention to the set of tables con- 
sistent with B. Let Ti be the set of tables satisfying the sim- 
ple implications Aq Ba and Ai Bi, and let T2 be the 
set of tables satisfying Aq ^ C and Ai C. Figure 4 is a 
diagrammatic representation of Ti and 7^. Each row in the 
the truth table on the left (resp., right) in Figure 4 represents 
a subset of Ti (resp., 7^). The variables a, b, c, d, e, /, g, kin 
the left-most (resp., a, b, d',f',h' in the right-most) column 
represents the size of the corresponding set. For example, 
the set of tables represented by the second row is the set of 
tables that satisfy the atom C but do not satisfy ^0 and Ai, 
and the number of such of tables is b. 

It is now clear from Figure 4 that the implications Aq 
C and Ai ^ C also produce the maximum disclosure as 



follows. Pr(C I A,e[2]^» ^ B,) 



b+d+f+h 



and Pr(C | AjgpjA 



a+b+c+d+e+f+g+h 
^-d'+f'+h' 
a+b+d'+f'+h' • 



C) = „^tf'.t('.t'^u, - Also 



'There always exists some set of k basic implications that maximize 
disclosure since there are only finitely many atoms and therefore ^^asic 
finite. 





A 


ie[2 












^ie\2 


(A, - 










Ao 


Ai 


Bo 


Bi 


C 






Ai 


Bo 


Bi 


C 




a 















= 















a 


b 








* 


* 


1 










* 


* 


1 


b 


c 





1 


* 


1 











d 





1 


* 


1 


1 


c 





1 


* 


* 


1 


d' 


e 


1 





1 


* 











/ 


1 





1 


* 


1 


c 


1 





* 


* 


1 


f 


9 


1 


1 


1 


1 











h 


1 


1 


1 


1 


1 


c 


1 


1 


* 


* 


1 


h' 



b+d+f+h 



Figure 4. Truth tables 



^ b+d+f+h ^ b+d'+f'+h' 



a+b+c+d+e+f+h — a+b+d+f+h — a+b+d'+f' + h' ■'"'^^ 

d < d', f < /', and h < h' . Thus Pr(C | Aje^jAj -> 
Bi) < Pr(C I Aie[2]^i ^ C). 

3.3. Computing maximum disclosure efficiently 

Having reduced our search space from sets of basic im- 
phcations that could lead to maximum disclosure to sets of 
simple implications with the same consequent, we are now 
in a position to create an efficient algorithm to compute the 
maximum disclosure. We want to maximize Pr(A | S A 
\e[k] (^i A)) over all atoms A. A,, i e [k]. Notice that 
for any atoms A, Ai, i G [k] such that A and /\i£[k]Ai — + A 
are consistent with bucketization B we have: 

Pr{A\BA{Aie[k]Ai^ A)) 

_ Pr{AA{A,eik]{Ai^ A))\B) 



Pr{{Aieik]{A^A))\B) 

Pr{A I B) 

PT{{^AA{Aieik]^Ai))VA\B) 

PrjA I B) 

Pr(^A A (AieikrAi) | B) + Pr(A | B) 



So it suffices to construct an efficient algorithm to minimize, 
over aU atoms A, Ai, i e [k]. 



Pr(^AA(A.gw^AO|B) 
PriA\B) 



(1) 



In Section 3.3.1, we show how to minimize 
Pr(Aig[j;]-iAi I B) over atoms Ai involving individu- 
als in the same bucket. We use this in Section 3.3.2 to 
provide a dynamic programming algorithm Minimize 1 
that minimizes Formula (1) over atoms A,Ai, i e [k] 
involving individuals in the same bucket. Finally, in 
Section 3.3.3, we use MinimizeI to construct another 
dynamic programming algorithm MINIMIZE2 to minimize 
Formula (1) jointly over the entire bucketization. 

3.3.1 Miniimzing Pr{Ai^[^-'Ai \ B) for one bucket 

Consider all sets of k atoms involving people whose tu- 
ples are in a single b G B. Each set of k atoms is asso- 



Algorithm 1 : MinimizeI (6, i, ki, k) 



Input: b is the bucket under consideration 

Input: i is the index of the next person Pifor which ki (i.e., the number 

of atoms involving person pi) is to be determined (initially 0) 
Input: ki is the the upper bound for ki (initially k) 
Input: k is the number of atoms for which the people involved have yet to 

be been determined ( initially k ) 

1: Pmin ^1 

2: for fci = 1, 2, . . . , min(A;i, fc) do 

p <— MlNIMIZEl(b, i + 1, fci, — fei) 
-i-T.jeiki] ^biil) 



p< 

Pmin ^ 

end for 
return pn 



- min(pinin,p) 



X p 



dated with a tuple {I, ko,. . . , ki-i), where / is the num- 
ber of people involved in the k atoms, and h is the num- 
ber of atoms involving the i-th person. We label the k 
atoms Aij for i G [/] and j G [ki] such that atom Aij 
is the j'-th atom (out of ki atoms) involving the i-th person. 
Lemma 12 provides a closed form for the minimum value 
of Pr(Ajg[fc] -lAj I B) over all sets of k atoms associated with 
a particular {I, ko, . . . , A;;_i). 



Lemma 12 Let b e B be any bucket. Let k, I, and ko, ki, 

. . . ,ki-i be such that k = T^^^qki and ki > fc^+i for 
all i G [I ~ 1]. Let s^,s^,s^,... be the sensitive val- 
ues arranged in descending order of frequency in b. Then 
Pr(Aig[;] jgife^j-iAi j I B) is minimized over all atoms Aij 
when, Aij is tp- [S] = si, for all i G [I] and all j S [ki], 
where po,pi, . . . ,pi-i € Pf, are distinct. Consequently, the 
minimum probability is given by: 



Note that 



llie[i] 
< k and k 



nb{sl) 



rib- 



(2) 

I ki since each atom in- 



volves at exactly one person. So the question of minimizing 
Pr(Ai£[fc] -lAj |B) over all atoms Ai that mention only tuples 



in b becomes one of minimizing Yli 



over all / < fc and all ko,. . ■ , such that J2ie[i] — ^■ 
This can easily be done using Algorithm 1. Thus, calling 
MinimizeI (6. 0, k, k) minimizes Pr(Aig[fe]-'Ai | ipe) over 
all atoms Ai that involve people with tuples in bucket b. It is 
easy to modify the algorithm to remember the minimizing 
values of fco, . . . , and thus we can even reconstruct the 
set of minimizing atoms according to Lemma 12. 

Algorithm complexity. Note that the parameters of 
MinimizeI are bounded. That is, for every recursive call 
MinimizeI (6, i,ki, k) that occurs inside the initial call to 
MinimizeI (6, 0, k, k), parameter b does not change, and 
parameters i, kj , k are all bounded by k (i.e., the number of 
imphcations we allow the attacker to know). So we can eas- 
ily turn this into an O(fc^) time and space algorithm using 
dynamic programming. 



3.3.2 Minimizing Formula (1) within one bucket 

Let us now mminuze — p^.^^|^^ over all k + 1 

atoms A and Ai, for i G [k], that only mention tuples in 
bucket b. Clearly any A, A^ that simultaneously minimize 
the numerator and maximize the denominator will work. 
We know that Minimize! (6, 0, fc + l.A: + 1) will mini- 
mize the numerator. According to Lemma 12, at least one 
of these minimal fc + 1 atoms mention the most frequent 
sensitive value. So, taking this atom to be A, we maximize 
the denominator as well. Thus, the minimum value is 

Thh 

Minimize 1(6,0, fc + 1, fc + 1) x — -r-nz. 

nb{sl) 

3.3.3 Minimizing Formula (1) over all buckets 

We look agam at mmimizmg p^^^J^^ , except 

this time, we allow A and A^ for i e [k] to mention tuples 
in possibly different buckets. To do this, we make use of 
the independence between buckets. Suppose that the fc + 1 
minimizing atoms (including A) are such that ki of them 
mention tuples in bucket hi, for each i e [Z] for some I < 
fc + 1. Let bj be the bucket containing the tuple mentioned 
by A. Then, since the permutation of sensitive values for 
each bucket was picked independently, we can compute the 
minimum as 

"/'"o . X ]J Minimize l(6i,0,fci,fci )• 

So we need to minimize the above for all choices of Z < 
fc + 1, j, and ko,ki, . . . ,ki-i (which we can assume with- 
out loss of generaUty to be in descending order). Assuming 
buckets in B are labeled as 6o , 6i , 62 , . • . . this is done by the 

MINIMIZE2. 

So Minimize2(0, fc, true) minimizes 

^'^^^T^l^t^i^''^^^^ over all atoms A,Ai, i e [fc]. It 
is easy to modify the algorithm to remember the z's and 
fti's, and hence reconstruct the minimizing atoms. 

Algorithm complexity. Note that the parameters of 
Minimize2 are bounded. That is, for every recursive call 
to Minimize2(«, /ij, a) that occurs inside the initial call 
to Minimize2(0, fc, true), parameter i is bounded by the 
number of buckets, parameter ki is bounded by the total 
number of impUcations fc, and a is either true or false. 
Thus, assuming that we first memoize (i.e., precompute 
all possible calls to) Minimize 1 (which we can do in 
time 0{\B\ x fc^)), we can modify the Minimize2 algo- 
rithm using dynamic programming to take an additional 
0{\B\ x fc)time and space. So the whole algorithm can be 
made to run in 0(\B\ x fc^)time and space. 

Incidentally, if one had two bucketizations B and B* that 
differed only in that B* was the result of removing some 



Algorithm 2 : MiNiMiZE2(i, 
Input: i is the current bucket bi (initially 0) 

Input: hi is number of atoms Aj , j 6 [k] that we have yet to determine 
(initially k) 

Input: a is a flag representing whether atom A involves a person in an 



earlier bucket bj, j < i (initially false) 

2: if i = \B\ then 

3 : // Finished all buckets 

4: return r^in 

5: end if 

6: for /li+i = 0,1,2, ... ,hi Ao 

7: M ^ MlNlMlZEl(6i, 0,/ii+i, /li+i) 

8: X ^MimMlZE2{i + l,hi — hi+i,true) 

9: if a = false then 

10: // Atom A does not involve an earlier bucket bj, j < i 

II: // So either A involves bi... 

12: V <— MlNlMlZEl(6i, 0, hi+i + 1, hi+i + 1) 

13: rmin ^ min(rmin,-u X X X nyr) 

14: // ... or else A involves a later bucket bj, j > i 

15: rniin<— min(rniin, w X Minimize2(«-)-1, hi —hij^\, false)) 

16: else 

17: //Atom A involves an earlier bucket bj, j < i 

18: rmin <— min(r 

min ) 

ux x) 

19: end if 

20: end for 

21: return r,„j„ 



buckets from B and adding x new buckets to B, then, after 
we run the algorithm for B, we memoize MINIMIZE 1 for 
the X new buckets; so the incremental cost of running the 
algorithm for B* is 0(|S*| x fc -|- a; x fc^)-time. Moreover, 
if one knew in advance which buckets were going to be re- 
moved, one could order the buckets bo,bi,. . . appropriately 
to reuse much of the memoization of Minimize2 as well. 

3.4. Finding a safe bucketization 

Armed with a method to compute the maximum disclo- 
sure, we now show how to efficiently find a "minimally san- 
itized" bucketization for which maximum disclosure is be- 
low a given threshold. Intuitively, we would like a minimal 
sanitization in order to preserve the utility of the published 
data. Let us be more concrete about the notion of minimal 
sanitization. Given a table, consider the set of bucketiza- 
tions of this table. We impose a partial ordering ^ on this 
set of bucketizations where B < B' if and only if every 
bucket in B' is the union of one of more buckets in B. Thus 
the bucketization Bt that has all the tuples in one bucket is 
the unique top element of this partial order, and the bucketi- 
zation B±_ that has one tuple per bucket is the unique bottom 
element of this partial order. Our notion of a "minimally 
sanitized" bucketization is one that is as low as possible in 
the partial order (i.e., as close to B±) while still having max- 
imum disclosure lower than a specified threshold. 

Definition 13 ((c, fc)-safety) Given a threshold c e [0, 1], 



we say that B is a (c, fc)-safe bucketization if the maximum 
disclosure ofB with respect to >Cbasic ^^^'^ ^ 

If the maximum disclosure is monotonic with respect to 
the partial ordering <, then finding a ^-minimal (c, fc)-safe 
bucketization can be done in time logarithmic in the height 
of the bucketization lattice (which is at most the number of 
tuples in the table) by doing a binary search. The following 
theorem says that we do indeed have monotonicity. 

Theorem 14 (Monotonicity) Let B and B' be bucketiza- 

tions such that B < B'. Then the maximum disclosure of 
B is at least as high as the maximum disclosure of B' with 
respect to Cl^^^. 

Another approach is to find all ^-minimal (c, A;)-safe buck- 
etizations, and return the one that maximizes a specified 
utility function. The monotonicity property allows us to 
make use of existing algorithms for efficient itemset mining 
[4], fc-anonymity [7, 22] and ^-diversity [24].^ For example, 
we can modify the Incognito [22] algorithm, which finds all 
the ^-minimal fc-anonymous bucketizations, by simply re- 
placing the check for fc-anonymity with the check for (c, fc)- 
safety from Section 3.3. We can thus find the bucketization 
that maximizes a given utility function subject to the con- 
straint that the bucketization be (c, fc)-safe. 

4. Experiments 

In this section, we present a case-study of our framework 
for worst-case disclosure using the Adult Database from the 
UCI Machine Learning Repository [27]. We only consider 
the projection of the Adult Database onto five attributes - 
Age, Marital Status, Race, Gender and Occupation. The 
dataset has 45,222 tuples after removing tuples with miss- 
ing values. We treat Occupation as the sensitive attribute; 
its domain consists of fourteen values. We use pre-defined 
generalization hierarchies for the attributes similar to the 
ones used in [22]. Age can be generalized to six levels (un- 
suppressed, generalized to intervals of size 5, 10, 20, 40, 
or completely suppressed). Marital Status can be general- 
ized to three levels, and Race and Gender can each either be 
left as is or be completely suppressed. We consider all the 
possible anonymized tables using those generalizations. 

We computed the maximum disclosure for k pieces of 
background knowledge, for k ranging from (i.e., no back- 
ground knowledge) to 12 (since we know that maximum 
disclosure certainly reaches 1 at fc = 13 because there are 
only fourteen possible sensitive values). Figure 5 plots, for 
one anonymized table, the number of pieces of knowledge 
available to an adversary against the maximum disclosure 

*While these algorithms typically have worst-case exponential running 
time in the height of the bucketization lattice, they have been shown to run 
fast in practice. 



for both negated atoms (i?-diversity) and basic implications. 
In the anonymized table used, all the attributes other than 
Age were suppressed and the Age attribute was general- 
ized to intervals of size 20. The solid line corresponds to 
imphcation statements and the dotted line corresponds to 
negated atoms. This graph agrees with our earlier observa- 
tion that implication-type background knowledge subsumes 
negation; the maximum disclosure for k negated atoms is 
always smaller than the maximum disclosure for k impli- 
cations. However, note that, for a given fc, the difference 
between the maximum disclosure for negated atoms and 
for basic implications is not too large. This means that an 
anonymized table which tolerates maximum disclosure due 
to fc negated atoms need not be anonymized much further to 
defend against fc implications. 

Intuitively, if all the buckets in a table have a nearly uni- 
form distribution, then the maximum disclosure should be 
lower, but the exact relationship is not obvious. To get a 
better picture, we performed the following experiment. We 
fixed a value fc for the number of pieces of information. 
For every entropy value h, we looked at all tables T{h) 
for which the minimum entropy of the sensitive attribute 
over all buckets was equal to h. Amongst T{h) we found 
the table T{h) with the least maximum disclosure for fc im- 
plications. Let the worst case disclosure for T{h) given fc 
pieces of knowledge be denoted by w{T{h), fc). We plotted 
h versus w{T(h), fc) for fc = 1, 3, 5, 7, 9, 11 in Figure 6. We 
see a behavior which matches our intuition. For a given fc, 
the disclosure risk monotonically decreases with increase in 
h. This is because increasing h means that we are looking 
at tables with more and more entropy in their buckets (and, 
consequently, less skew). We plotted an analogous graph 
(which we do not show here) for negation statements and 
observed very similar behavior. 

5. Related Work 

Many metrics have been proposed to quantify privacy 
guarantees in pubUshing publishing anonymized data-sets. 
'Perfect privacy' [12, 26] guarantees that published data 
does not disclose any information about the sensitive data. 
However, checking whether a conjunctive query discloses 
any information about the answer to another conjunctive 
query is shown to be very hard (n2-complete [26]). Subse- 
quent work showed that checking for perfect privacy can be 
done efficiently for many subclasses of conjunctive queries 
[23]. Perfect privacy places very strong restrictions on the 
types of queries that can be answered [26] (in particular, 
aggregate statistics cannot be published). Less restrictive 
privacy definitions based on asymptotic conditional proba- 
bilities [11] and certain answers [30] have been proposed. 
Statistical databases allow answering aggregates over sen- 
sitive values without disclosing the exact value [1]. De- 



Figure 5. Disclosure vs # pieces of background knowledge 

identification, like /c-anonymity [28, 32J and "blending in 
a crowd" [8], ensures that an individual cannot be associ- 
ated with a unique tuple in an anonymized table. However, 
under both of those definitions, sensitive information can be 
disclosed if groups are homogeneous. 

Background knowledge can lead to disclosure of sensi- 
tive information. Su et al. [31] and Yang et al. [35] limit dis- 
closure when functional dependencies in the data are known 
to the data publisher upfront. The notion of ^-diversity [24] 
guards against limited amounts of background knowledge 
unknown to the data publisher. Farkas et al. [16] provide a 
survey of indirect data disclosure via inference channels. 

There are several approaches to anonymizing a dataset 
to ensure privacy. These include generalizations [7, 22, 29], 
cell and tuple suppression [9, 29], adding noise [1,5,8,15], 
publishing marginals that satisfy a safety range |14|, and 
data swapping [10], where attributes are swapped between 
tuples so that certain marginal totals are preserved. Queries 
can be posed online and the answers audited [20] or per- 
turbed [13]. Not all approaches guarantee privacy. For ex- 
ample, spectral techniques can separate much of the noise 
from the data if the noise is uncorrelated with the data 
[17, 19]. Anatomy [34] is a recently proposed anonymiza- 
tion technique that corresponds exactly to the notion of 
bucketization that we use in this paper. When the attacker 
knows full identification information, then generalization 
provides no more privacy than bucketization. However, 
we recommend generalizing the attributes before publishing 
the data since this will prevent attackers that do not already 
have full identification information from reidentifying indi- 
viduals via linking attacks [32]. In many cases, the fact that 
a particular individual is in the table is considered sensitive 
information [8]. 

The utility of data that has been altered to preserve pri- 
vacy has often been studied for specific future uses of the 
data. Work has been done on preserving association rules 
while adding noise [15]; reconstructing distributions of con- 
tinuous variables after adding noise with a known distri- 
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Figure 6. Entropy vs Maximum Disclosure Risk 

bution [5, 3]; reconstructing data clusters after perturbing 
numeric attributes [8]; and maximizing decision tree ac- 
curacy while anonymizing data [18, 33]. There have also 
been some negative results for utility. Publishing a single 
/c-anonymous table can suffer from the curse of dimension- 
ality [2] - large portions of the data need to be suppressed to 
ensure privacy. Subsequent work [21] shows how to pubUsh 
several tables instead of a single one to combat this. 

6. Conclusions 

In this paper, we initiate a formal study of the worst- 
case disclosure with background knowledge. Our analy- 
sis does not assume that we are aware of the exact back- 
ground knowledge possessed by the attacker. We only as- 
sume bounds on the the attacker's background knowledge 
in terms of the number of basic units of knowledge that 
the attacker possesses. We propose basic imphcations as an 
expressive choice for these units of knowledge. Although 
computing the probability of a specific disclosure from a 
given set of k basic implications is intractable, we show how 
to efficiently determine the worst-case over all sets of k ba- 
sic implications. In addition, we show how to search for a 
bucketization that is robust (to a desired threshold c) against 
any k basic implications by combining our check for (c, fc)- 
safety with existing lattice-search algorithms. Finally, we 
demonstrate that, in practice, ^-diversity has similar maxi- 
mum disclosure to our notion of (c, fc)-safety, which guards 
against a richer class of background knowledge. 

Since we chose basic imphcations as our units of knowl- 
edge, our algorithms will clearly yield very conservative 
bucketizations if we try to protect against an attacker who 
knows information that can only be expressed using a large 
number of basic implications. One way to reduce the num- 
ber of basic units required is to add more powerful atoms to 
our existing language. Finding the right language for basic 
units of knowledge is an important direction of future work. 

Other directions for future work include extending our 



framework for probabilistic background knowledge, study- 
ing cost-based disclosure (since it was observed in [24] 
that not all disclosures are equally bad), and extending 
our results to other forms of anonymization, such as data- 
swapping and collections of anonymized marginals [21]. 
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